Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #274 - Views Incorrectly Encode UTF Characters as HTML Entities #403

Open
wants to merge 1 commit into
base: hotfix
Choose a base branch
from

Conversation

GoryMoon
Copy link

@GoryMoon GoryMoon commented Nov 26, 2023

Description

See the issue #274 for more details about the issue.

I identified that when the incorrectly encoded characters showed up the to_html function didn't have the $toHTML global array so instead used the htmlentities function causing the html encodings.

if (is_array($toHTML)) {
$string = str_ireplace($GLOBALS['toHTML_keys'], $GLOBALS['toHTML_values'] ?? [], $string);
} else {
$string = htmlentities($string, ENT_HTML401|ENT_QUOTES, 'UTF-8');

With the small change of defining the variable in the $GLOBALS array, the function will use that and encode it correctly.

There might be more related to encoding issues but this fixes my issues and what it looks like the issue is having.

See this comment for an update on the changes to this: #403 (comment)

Motivation and Context

How To Test This

  1. Change the name of something that is referenced.
    a. For example, change the company name that is referenced in a Call. (Like adding åäö or other letters)
  2. Got the the list of that data and you should see the HTML encoded letters

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Final checklist

  • My code follows the code style of this project found here.
  • My change requires a change to the documentation.
  • I have read the How to Contribute guidelines.

@serhiisamko091184
Copy link
Contributor

Hello @GoryMoon,
thank you very much for your contribution!

Would you be so kind to check with me replication steps:

  1. Initially I've created an account:

image

  1. After that I created a call to make a relationship with the account:
    image

  2. I've applied the current fix and the letters "åäö" are encoded properly. At the same time single quotes are rendered without appropriate encoding:
    image

Is this something you could have a look at if you have time (characters ', '', <, >)?

Thanks in advance!

Regards,
Serhii

@serhiisamko091184 serhiisamko091184 added the Status:Requires Updates Issues & PRs which requires input or update from the author label Nov 27, 2023
@pgorod
Copy link
Contributor

pgorod commented Nov 27, 2023

I don't know if anybody read my comment in the Issue...

I'd really like to understand this better. The question "how is this in the Database" is an important one. If there's HTML encoding in the database, it's wrong. If we don't put it there in the first-place, it's better.

I also don't think that we should be re-implementing basic stuff that PHP is supposed to be able to handle correctly, like string substitutions to encode HTML. Otherwise, now you fixed four cases, then @serhiisamko091184 asks you for 4 more, and we can go on indefinitely but we should just use complete functions from the existing libraries.

@GoryMoon
Copy link
Author

I have done some debugging to get a better understanding of when the encoding issues occur and why it works in some places. This is just a collection of stuff I found related to this issue.
I did this testing while forcing the to_html() to use htmlentities() instead of the string replace one.

 

I'd really like to understand this better. The question "how is this in the Database" is an important one. If there's HTML encoding in the database, it's wrong. If we don't put it there in the first-place, it's better.

From testing it looks like it saves it like normal and not HTML encoded, when saving in cleans up the input and removes some stuff if it's not allowed. When trying to insert <script>alert(1);</script> in a company name it will remove the whole script.
After manually inserting the above script into the database it will show the text but don't execute it. As Angular just treats it as text when inserting it into the DOM, to make it run it needs to be like a html-field and not like for example like the relate-field currently is.

The encoding error happens on pages where the GraphQL returns the data HTML encoded and angular handles the data. For pages when it works, it's either not encoded when angular receives it or it's a legacy iframe that contains the HTML-encoded values directly.

As an example: on the Account page the records are received from the RecordHandler which tells the Bean not to encode the data:

$bean = BeanFactory::getBean($moduleName, $id, ['encode' => false]);

Most issues where you have a Related To field exist because the underlying list fetching data encodes the parent data, and the normal record data is decoded.

As an example this is from the Calls list the name of the record is correct but the Related To is encoded
Screenshot_20231128_000944
The record is correct because it goes through this function and maps the fields but not the parent_name field containing the related name.

$arr[$name] = html_entity_decode($bean->$field ?? '', ENT_QUOTES);

Another place would be the timeline where it could be called to not encode the data from the backend code like the Account records.
Screenshot_20231128_001409

A encode parameter could be added to reach the fetchAll functions until you get to the SubpanelCustomQueryPort that runs the fetchByAssoc function that takes the encode parameter.

 

I also don't think that we should be re-implementing basic stuff that PHP is supposed to be able to handle correctly, like string substitutions to encode HTML. Otherwise, now you have fixed four cases, then serhiisamko091184 asks you for 4 more, and we can go on indefinitely but we should just use complete functions from the existing libraries.

I agree with using existing functions rather than reimplementing them.
I didn't know how much would be needed to fix this but when I found that the function didn't work as intended this could ba e small fix the solve part of the annoyance of encoding utf8 characters.

@SuiteBot
Copy link

SuiteBot commented Aug 27, 2024

CLA assistant check
All committers have signed the CLA.

@GoryMoon
Copy link
Author

I've looked over this once again after discovering that a change in version 8.6.1 caused descriptions and other fields to have the wrong encoding.
This was the change that caused it:
f483bec#diff-033d0d7079b9ca4c1ff8658dc33517d717937747e7ee218622dae6abb0173273R60-R65

I've tested adding the string that @yunusyerli1 was using in #501 (Şebnem Çakğır šččřž 안녕하세요감사합니다ΚαλημέραΕυχαριστώสวัสดีनमस्ते) to a bunch of places. Name, description, address and more.

Compared to the previous change this is a bit bigger.
The main cause for the encoding issue was the use of htmlentities, by replacing it with htmlspecialchars only characters that have a special significance in HTML is encoded leaving UTF-8 characters.
I also removed the "re-implemented" encoding/decoding that @pgorod pointed out and I agree with.

There a more htmlentities in the code but I haven't looked up how they get called as this change is solves the locations I've noticed.

@GoryMoon
Copy link
Author

Even after the previous changes quotes " ' and < > was encoded, by adding the html_entity_decode to the TextMapper this is fixed.

In some specific cases it double encoded some characters so that's why there are two html_entity_decode.

This mapper is used for the api and text fields so html code shouldn't be a problem. I've tested with injecting this <script>alert(1);</script> in the database (it's cleaned by the purify) and directly in the code and nothing is executed.

This change resolves #510

@pgorod
Copy link
Contributor

pgorod commented Aug 30, 2024

If there's double-encoding, shouldn't it be fixed by preventing the second encoding, instead of adding a second decode?

I'm very concerned with the way that the craziness of v7 misguided/excessive purification is creeping into v8 too... we really really really must not just patch "undo" code on top of "let's mess things up" code. Twice.

This is not a criticism of the current PR, it is a wider issue... still, we shouldn't merge just because it solves the current problem, we should really check that we're doing things the right way.

@chris001
Copy link
Contributor

chris001 commented Sep 1, 2024

The code paths could be traced, and places where double encoding happens should be eliminated, which could be difficult.

Or, the PHP string could be encapsulated inside a SuiteString class that adds methods to get the various encoded versions of the original UTF-8 string - for the web interface, etc. Problem solved.

@GoryMoon
Copy link
Author

GoryMoon commented Sep 2, 2024

Just documenting the double encoding reason, it's needed for the specific cases below.
I agree that it's not a good solution and should be prevented instead of going back and forth.

Without the double decoding the description looks like this:
firefox_uW1fTaHJ8j

Adding one decode it will look like this, the part that is double encoded looks to be non html/xml block.
firefox_euvJUzpXjd

And with both decodes it looks like this:
firefox_WfNVudCSSy


This issue was introduced in f483bec for the v8.6.1 release to fix what looks like a bunch of CVEs.
Not sure it there was a specific case for the TextMapper or it if was applied to anything that might cause it. That makes me a bit hesitant to change too much without knowing the reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants